
Game of Thrones is a popular fantasy TV show based on a series of books written by George RR Martin.This notebook showcases the analysis and predictions of the battles in the series.
As a fan of the hit TV show, I have been amazed by the battle scenes in every season. One of my favorite battles is the 'Battle of the Bastards" in Season 6, Episode 9. What I like about this battle is that Jon Snow managed to convince the Free Folks into joining his army and fight against House Bolton. This episode was well directed in terms of the number of people involved in sequencing the battle and there were many detailed attacks in the episodes.
My notebook analyzes Chris Albon’s “The War of the Five Kings” Dataset, which can be found here. It’s a great collection of all of the battles in the series.
I plan on tackling three key questions from the battles dataset:
1) Which house wins the most battles in any situation?
2) What is the expected size of the defending army given the size of the attacking army?
3) What factors contribute to a battle victory?
Load packages
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
import seaborn as sns
from pylab import rcParams
from collections import Counter
from time import time
from pandas_profiling import ProfileReport
from IPython.display import display
import statsmodels.api as sm
import plotly.express as px
# Import supplementary visualization code visuals.py
import visuals as vs
rcParams['figure.figsize'] = 12, 9
plt.style.use('ggplot')
# @hidden_cell
import warnings
warnings.filterwarnings("ignore")
Load dataset
battles_df = pd.read_csv('../data/battles.csv')
battles_df.head()
In reviewing the other kernels to see what has been done, one particular kernel on Kaggle pointed out the data entry mistake on the Battle of Castle Rock. Having watched the TV series, I know for a fact that Mance Rayder has 100K wildings and Stannis Baratheon has 1,240 troops. I flipped the names in the dataset. This should be a major callout to anyone using this dataset
attacker and defender names are dropped in order to look at the dataset clearly.
battles_df.drop(['attacker_1','attacker_2','attacker_3','attacker_4','defender_1','defender_2','defender_3','defender_4','note'],axis=1).head()
For this step, I will change attacker_outcome to boolean and fill nan with zeros for attacker_outcome, major_death, major_capture, summer.
battles_df['attacker_outcome_flag'] = battles_df['attacker_outcome'].map({'win': 1, 'loss': 0})
battles_df['attacker_outcome_flag'] = battles_df['attacker_outcome_flag'].fillna(0)
battles_df['major_death'] = battles_df['major_death'].fillna(0)
battles_df['major_capture'] = battles_df['major_capture'].fillna(0)
battles_df['summer'] = battles_df['summer'].fillna(0)
#Check dataset
battles_df[['attacker_outcome','major_death','major_capture','summer']].head()
Whenever I start a project, I like to run the dataset through Pandas Profiler.
profile = ProfileReport(battles_df, title='Game of Thrones Battles - Pandas Profiling Report', style={'full_width':True})
profile
profile.to_file(output_file="../output/got_battles_data_profile.html")
The columns attacker_2,attacker_3,attacker_4,defender_2,defender_3,defender_4, attacker_commander, defender_commander can be used to count the number of houses involved in the battle.
Five columns will be created for:
battles_df['attack_houses'] = battles_df[['attacker_1','attacker_2','attacker_3','attacker_4']].notnull().sum(axis=1)
battles_df['attack_houses'] = pd.to_numeric(battles_df.attack_houses)
battles_df['defender_houses'] = battles_df[['defender_1','defender_2','defender_3','defender_4']].notnull().sum(axis=1)
battles_df['defender_houses'] = pd.to_numeric(battles_df.defender_houses)
# Check data
battles_df[['attacker_1','attacker_2','attacker_3','attacker_4','attack_houses','defender_1','defender_2','defender_3','defender_4','defender_houses']].sort_values(by=['attack_houses','defender_houses'],ascending=[False,False]).head()
Count occurence of attacker_commander and defender_commander
battles_df['attacker_commander'].str.split(',', expand=True).head()
battles_df['attacker_commander_count'] = battles_df['attacker_commander'].str.split(',', expand=True).notnull().sum(axis=1)
battles_df[['attacker_commander','attacker_commander_count']].head()
battles_df['defender_commander'].str.split(',', expand=True).head()
battles_df['defender_commander_count'] = battles_df['defender_commander'].str.split(',', expand=True).notnull().sum(axis=1)
battles_df[['defender_commander','defender_commander_count']].head()
Drop columns with missing data
battles_df = battles_df.drop(columns = ['battle_number','attacker_2','attacker_3','attacker_4','defender_2','defender_3','defender_4','note'])
battles_df.head()
Create battle_size for the total number of people involved in a battle.
battles_df['battle_size'] = battles_df['attacker_size'] + battles_df['defender_size']
battles_df[['attacker_size','defender_size','battle_size']].head()
Plot correlation
corr_plot = battles_df.corr(method='pearson').style.set_caption('Correlation for Game of Thrones Battles').background_gradient(cmap='coolwarm').set_precision(4)
corr_plot
def clean_battle_data(df):
df['attacker_outcome_flag'] = df['attacker_outcome'].map({'win': 1, 'loss': 0})
# Fill NaN with zero
df['attacker_outcome_flag'] = df['attacker_outcome_flag'].fillna(0)
df['major_death'] = df['major_death'].fillna(0)
df['major_capture'] = df['major_capture'].fillna(0)
df['summer'] = df['summer'].fillna(0)
df['attacker_size'] = df['attacker_size'].fillna(0)
df['defender_size'] = df['defender_size'].fillna(0)
# The columns attacker_2,attacker_3,attacker_4,defender_2,defender_3,defender_4, attacker_commander, defender_commander can be used to count the number of houses involved in the battle.
df['attack_houses'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].notnull().sum(axis=1)
df['attack_houses'] = pd.to_numeric(df.attack_houses)
df['defender_houses'] = df[['defender_1','defender_2','defender_3','defender_4']].notnull().sum(axis=1)
df['defender_houses'] = pd.to_numeric(df.defender_houses)
# Count attacker_commander
df['attacker_commander_count'] = df['attacker_commander'].str.split(',', expand=True).notnull().sum(axis=1)
# Count defender_commander
df['defender_commander_count'] = df['defender_commander'].str.split(',', expand=True).notnull().sum(axis=1)
# Drop columns with missing data
df = df.drop(columns = ['battle_number','attacker_2','attacker_3','attacker_4','defender_2','defender_3','defender_4','note'])
# Create battle_size columns
df['battle_size'] = df['attacker_size'] + df['defender_size']
df['battle_size'] = df['battle_size'].fillna(0)
return df
Test function
battles_df = pd.read_csv('../data/battles.csv')
battles_df = clean_battle_data(battles_df)
battles_df.head()
Export clean data
battles_df.to_csv('../data/battles_clean.csv', index = False)

profile = ProfileReport(battles_df, title='Game of Thrones Battles - Pandas Profiling Report', style={'full_width':True})
profile.to_file(output_file="../output/got_battles_data_profile.html")
profile
attacker_outcome
32 battles out of 38 battles were won (84.2%).
attacker_king
On the offense, Joffrey/Tommen Baratheon were the attacking kings 36.8% of the time (14 battles) while Mance Rayder was only the attacking king once (5.3%).
defender_king
On the defense, Robb Stark has been attacked 36.8% of the times (14 battles) while Joffrey/Tommen Baratheon were second (34.2% or 13 battles). Renly Baratheon defended once (2.6%).
battle_type
We can see that the most common battle_type is pitched battle, appearing 36.8% (14 times) while razing was only 5.3% (twice).
region
Most of the battles were fought in The Riverlands (44.7% or 17 battles) while the second most battles fought were in The North (26.3% or 10 battles). There was only one battle Beyond The Wall (2.6%).
summer
Most of the battles were fought in the summer (26 or 68.4%) while the remaing were fought in the winter (12 battles or 31.6%).
year
Majority of the battles were fought in the year 299 (52.6% or 20 battles) and the second in the year 300 (28.9% or 11 battles). The remainder year 298 had only 7 battles (18.4%).
We will examine using multiple variables to see how they play together.
df_grouped = battles_df.groupby(by=['attacker_king']).agg(
attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
attacker_outcome_wins = ('attacker_outcome_flag','sum'),
attacker_size_mean = ('attacker_size', 'mean'),
attacker_size_std = ('attacker_size', 'std'),
defender_size_mean = ('defender_size','mean'),
defender_size_std = ('defender_size','std')).reset_index().sort_values(by = 'attacker_outcome_flag_count', ascending = False)
df_grouped['attacker_outcome_loss'] = df_grouped['attacker_outcome_flag_count'] - df_grouped['attacker_outcome_wins']
df_grouped['attacker_outcome_wins_pct'] = (df_grouped['attacker_outcome_wins']/df_grouped['attacker_outcome_flag_count']) * 100
df_grouped['attacker_outcome_loss_pct'] = 100 - df_grouped['attacker_outcome_wins_pct']
df_grouped
# Group data by attacker_king and calculate win, loss, win percentage, loss percentage, attacker size mean, defender size mean
df_grouped[['attacker_king','attacker_outcome_wins','attacker_outcome_loss','attacker_outcome_wins_pct',
'attacker_outcome_loss_pct','attacker_size_mean','defender_size_mean']].sort_values(
by='attacker_outcome_wins_pct',ascending=False).round(1).rename(
columns={"attacker_king": "Attacker King", "attacker_outcome_wins": "Wins",
"attacker_outcome_loss": "Loss", "attacker_outcome_wins_pct":"Win Percentage",
"attacker_outcome_loss_pct" : "Loss Percentage",
"attacker_size_mean":"Attacker Size Mean",
"defender_size_mean":"Defender Size Mean"})
# Plot barplot by attacker_king with attacker_outcome_wins_pct and attacker_outcome_loss_pct
df_grouped[['attacker_king','attacker_outcome_wins_pct','attacker_outcome_loss_pct']].plot.bar(x='attacker_king')
The Greyjoys have won all their seven battles as the attacking king with a significantly small army size of 183 while Mance Rayder has lost his only battle with 100K man attacking a defending army of only 1240! It looks like a small yet nimble army can definitely win a battle if they don’t have anyone to fight against!
Joffrey/Tommen Baratheon have won most of the battles (13 or 92.9%) with an average attacking army size of 4,330 and fought against a defending army of 2,559.
Robb Stark has won 80% of his battles (8 out of 10 battles) with an average attacking army size of 4122 and fought against an defending army size of 4,963.
Lastly, Stannis Baratheon won 50% of his battle (2 out of 4 battles) with an average attacking army size of 8875 and fought against an defending army size of 8,862. What is interesting is that the attacking and defending army sizes are very similar.
# Plot barplot by attacker_king with attacker_size_mean and defender_size_mean
df_grouped[['attacker_king','attacker_size_mean','defender_size_mean']].plot.bar(x='attacker_king')
# Remove Mance Rayder as it is an outlier
# Plot barplot by attacker_king with attacker_size_mean and defender_size_mean
df_grouped[['attacker_king','attacker_size_mean','defender_size_mean']][df_grouped.attacker_king != 'Mance Rayder'].plot.bar(x='attacker_king')
Analyze data broken by the year
df_year = battles_df.groupby(by=['year']).agg(
battles = ('name','count'),
major_death = ('major_death', 'sum'),
major_capture = ('major_capture','sum')).reset_index().sort_values(by = 'year', ascending = True)
df_year.plot.bar(x='year')
display(df_year)
sns.countplot(x='region',hue='attacker_king', data = battles_df)
df_region = battles_df.groupby(by=['region']).agg(
battles_count = ('name','count'),
major_death = ('major_death', 'sum'),
major_capture = ('major_capture','sum')).reset_index().sort_values(by = 'battles_count', ascending = False)
df_region.plot.bar(x='region')
display(df_region)
Attacker Win/Loss Percentage — Battle Type
# count battles by battle_type
pd.value_counts(battles_df['battle_type']).plot.bar()
There are four types of battles:
Pitched battle (14 battles)
Siege (12 battles)
Ambush (10 battles)
Razing (2 battles)
Pitched battle, sieges, and ambushes are common battles that the houses faces. On rare occasion do houses face a razing but when they do, there are no major deaths or major captures.
df_battle_type = battles_df.groupby(by=['battle_type']).agg(
battles_count = ('name','count'),
major_death = ('major_death', 'sum'),
major_capture = ('major_capture','sum')).reset_index().sort_values(by = 'battles_count', ascending = False)
df_battle_type.plot.bar(x='battle_type')
display(df_battle_type)
Pitched battle, sieges, and ambushes are common battles that the houses faces. On rare occasion do houses face a razing but when they do, there are no major deaths or major captures.
# Count battles by region
pd.value_counts(battles_df['region']).plot.bar()
# Count battles by attacker_1
pd.value_counts(battles_df['attacker_1']).plot.bar()
# Count battles by defender_1
pd.value_counts(battles_df['defender_1']).plot.bar()
# Count battles by summer
pd.value_counts(battles_df['summer']).plot.bar()
# Count battles by attacker_size
battles_df['attacker_size'].hist(bins=20)
# Count battles by defender_size
battles_df['defender_size'].hist(bins=20)
# Count battles by attack_houses
battles_df['attack_houses'].hist(bins=10)
# Count battles by defender_houses
battles_df['defender_houses'].hist(bins=10)
# Count battles by attacker_commander_count
battles_df['attacker_commander_count'].hist(bins=10)
# Count battles by defender_commander_count
battles_df['defender_commander_count'].hist(bins=10)
Aggregate by battle_type and count of attacker_outcome_flag, sum of attacker_outcome_flag, mean of attacker_size and defender_size.
df_battle_type = battles_df.groupby(by=['battle_type']).agg(
battles_count = ('name','count'),
attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
attacker_outcome_wins = ('attacker_outcome_flag','sum'),
attacker_size_mean = ('attacker_size', 'mean'),
defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)
df_battle_type['attacker_outcome_loss'] = df_battle_type['attacker_outcome_flag_count'] - df_battle_type['attacker_outcome_wins']
df_battle_type['attacker_outcome_wins_pct'] = (df_battle_type['attacker_outcome_wins']/df_battle_type['attacker_outcome_flag_count']) * 100
df_battle_type['attacker_outcome_loss_pct'] = 100 - df_battle_type['attacker_outcome_wins_pct']
df_battle_type
# Plot attacker_outcome_wins_pct by battle_type
sns.factorplot(x="battle_type", y="attacker_outcome_wins_pct",
aspect=0.8,
kind="bar", data=df_battle_type)
When Kings attack, pitched battles are won 71.4% of the time and sieges are won 83.3% of the time. Ambushes and razings are won 100% by the attacking kings
Aggregate by summer, battle_type, attacker_outcome and count of attacker_outcome_flag, sum of attacker_outcome_flag, mean of attacker_size and defender_size.
df_battle_type_summer = battles_df.groupby(by=['summer','battle_type']).agg(
battles_count = ('name','count'),
attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
attacker_outcome_wins = ('attacker_outcome_flag','sum'),
attacker_size_mean = ('attacker_size', 'mean'),
defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)
df_battle_type_summer['attacker_outcome_loss'] = df_battle_type_summer['attacker_outcome_flag_count'] - df_battle_type_summer['attacker_outcome_wins']
df_battle_type_summer['attacker_outcome_wins_pct'] = (df_battle_type_summer['attacker_outcome_wins']/df_battle_type_summer['attacker_outcome_flag_count']) * 100
df_battle_type_summer['attacker_outcome_loss_pct'] = 100 - df_battle_type_summer['attacker_outcome_wins_pct']
df_battle_type_summer
#
sns.factorplot(x="battle_type", y="attacker_outcome_wins_pct",
col="summer", aspect=1,
kind="bar", data=df_battle_type_summer)
During the winter, pitched battles and razings are won 100% of the time by the attacking king while during summer, only ambushes and sieges are won 100% of the time. There are no ambushes during the winter nor are there razings in the summer time.
In the winter, sieges are won 71.4% of the time while in the summer, pitched battles are won 63.6% of the time.
df_battle_attacker_king = battles_df.groupby(by=['attacker_king','battle_type']).agg(
battles_count = ('name','count'),
attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
attacker_outcome_wins = ('attacker_outcome_flag','sum'),
attacker_size_mean = ('attacker_size', 'mean'),
defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)
df_battle_attacker_king['attacker_outcome_loss'] = df_battle_attacker_king['attacker_outcome_flag_count'] - df_battle_attacker_king['attacker_outcome_wins']
df_battle_attacker_king['attacker_outcome_wins_pct'] = (df_battle_attacker_king['attacker_outcome_wins']/df_battle_attacker_king['attacker_outcome_flag_count']) * 100
df_battle_attacker_king['attacker_outcome_loss_pct'] = 100 - df_battle_attacker_king['attacker_outcome_wins_pct']
df_battle_attacker_king.sort_values('attacker_king', ascending = True)
chart = sns.catplot(x="attacker_king", y="attacker_outcome_wins_pct",
col="battle_type", aspect=1,
kind="bar", data=df_battle_attacker_king)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
When these battles are broken into the attacking kings against the attacking king win percentage, we see that the Grejoys have been won every type of battle.
The Starks do not do well in pitched battle as they lost 2/3 of the time as attackers. However, they win all the time in sieges and ambushes as attackers.
Meanwhile, Stannis Baratheon won 50% of the pitched battles and sieges.
Lastly, Joffrey/Tommen Baratheon won 83.3% of pitched battles but all the sieges and ambushes.
df_battle_defender_king = battles_df.groupby(by=['defender_king','battle_type']).agg(
battles_count = ('name','count'),
attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
attacker_outcome_wins = ('attacker_outcome_flag','sum'),
attacker_size_mean = ('attacker_size', 'mean'),
defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)
df_battle_defender_king['attacker_outcome_loss'] = df_battle_defender_king['attacker_outcome_flag_count'] - df_battle_defender_king['attacker_outcome_wins']
df_battle_defender_king['attacker_outcome_wins_pct'] = (df_battle_defender_king['attacker_outcome_wins']/df_battle_defender_king['attacker_outcome_flag_count']) * 100
df_battle_defender_king['attacker_outcome_loss_pct'] = 100 - df_battle_defender_king['attacker_outcome_wins_pct']
df_battle_defender_king
chart = sns.catplot(x="defender_king", y="attacker_outcome_loss_pct",
col="battle_type", aspect=1,
kind="bar", data=df_battle_defender_king)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
When these battles are broken into the defending kings against the attacking king loss percentage, we see that the Joffrey/Tommen Baratheon won 75% of the pitched battles as defenders but only survived 1/3 of sieges.
Robb Stark only won 1/6 of pitched battles while losing all the ambushes and sieges as defenders.
Stannis Baratheon also survive 1/3 of sieges while losing all the pitched battles.
Aggregate by defender_king, defender_king, attacker_outcome and count of attacker_outcome_flag, sum of attacker_outcome_flag, mean of attacker_size and defender_size.
df_attack_defend = battles_df.groupby(by=['attacker_king','defender_king']).agg(
battles_count = ('name','count'),
attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
attacker_outcome_wins = ('attacker_outcome_flag','sum'),
attacker_size_mean = ('attacker_size', 'mean'),
defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)
df_attack_defend['attacker_outcome_loss'] = df_attack_defend['attacker_outcome_flag_count'] - df_attack_defend['attacker_outcome_wins']
df_attack_defend['attacker_outcome_wins_pct'] = (df_attack_defend['attacker_outcome_wins']/df_attack_defend['attacker_outcome_flag_count']) * 100
df_attack_defend['attacker_outcome_loss_pct'] = 100 - df_attack_defend['attacker_outcome_wins_pct']
df_attack_defend
# Plot attacker_king and defender_king by battles_count
chart = sns.catplot(x="attacker_king", y="battles_count",
col="defender_king", aspect=1,
kind="bar", data=df_attack_defend)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
# Plot attacker_king vs attacker_outcome_loss_pct split by defender_king
chart = sns.catplot(x="attacker_king", y="attacker_outcome_loss_pct",
col="defender_king", aspect=1,
kind="bar", data=df_attack_defend)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
# Plot attacker_king vs attacker_outcome_wins_pct split by defender_king
chart = sns.catplot(x="attacker_king", y="attacker_outcome_wins_pct",
col="defender_king", aspect=1,
kind="bar", data=df_attack_defend)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
# Plot attacker_king vs attacker_size and remove Mance Rayder
chart = sns.boxplot(x="attacker_king", y="attacker_size", data=battles_df[battles_df.attacker_king != 'Mance Rayder'], palette="Set1")
chart.set_xticklabels(chart.get_xticklabels(),rotation=30)
# Plot attacker_king vs defender_size and remove Mance Rayder
chart = sns.boxplot(x="attacker_king", y="defender_size", data=battles_df[battles_df.attacker_king != 'Mance Rayder'], palette="Set1")
chart.set_xticklabels(chart.get_xticklabels(),rotation=30)
# Plot attacker_king vs attacker_size, filter to attacker_outcome == 'win' and remove Mance Rayder
chart = sns.boxplot(x="attacker_king", y="attacker_size", data=battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_outcome == 'win')], palette="Set1")
chart.set_xticklabels(chart.get_xticklabels(),rotation=30)
# Plot attacker_king vs attacker_size, filter to attacker_outcome == 'loss' and remove Mance Rayder
chart = sns.boxplot(x="attacker_king", y="attacker_size", data=battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_outcome == 'loss')], palette="Set1")
chart.set_xticklabels(chart.get_xticklabels(),rotation=30)
If these battles and kings were formatted like a sports league, it would look like above.
attack_df = battles_df[['attacker_king','attacker_outcome_flag']].rename(columns={"attacker_king":"king","attacker_outcome_flag":"flag"})
defend_df = battles_df[['defender_king','attacker_outcome_flag']].rename(columns={"defender_king":"king","attacker_outcome_flag":"flag"})
defend_df['flag'] = defend_df['flag'].map({1: 0, 0: 1})
df_attack_defend = attack_df.append(defend_df,ignore_index=True)
league_df = df_attack_defend.groupby(by=['king']).agg(
battle_count = ('king','count'),
win_count = ('flag','sum')
).reset_index().sort_values(by = 'battle_count', ascending = False)
league_df['loss_count'] = league_df['battle_count'] - league_df['win_count']
league_df['win_pct'] = ((league_df['win_count']/league_df['battle_count'])*100)
league_df['loss_pct'] = ((league_df['loss_count']/league_df['battle_count'])*100)
# round values
league_df = league_df.sort_values('win_pct',ascending=False).round(1)
league_df[['win_count','loss_count']] = league_df[['win_count','loss_count']].astype('int64')
display(league_df)
display(league_df[['king','win_pct']].plot.bar(x='king'))
Balon/Euron Grejoy and Joffrey/Tommen Baratheon are almost neck to neck in winning percentage (63.6% vs 63.0% respectively) while Stannis Baratheon has a win rate of 42.9% and Robb Stark has a win rate of 37.5%.
This makes Balon/Euron Grejoy a winner in terms of winning percentage.
Even the Greyjoys have a 100% success rate in the battles when they attack, I would not declare them as a powerhouse since they have not fought with a defending army (based on the lack of data) so I would disqualify them.

There are two types of models that I will be performing to answer the two questions:
2) What is the expected size of the defending army given the size of the attacking army? → Linear Regression Model
3) What factors contribute to a battle victory? → Classification Model
For creating a linear regression, the attacker size will be plotted against the defender size. I want to understand the relationship between the two variables.
For this dataset, outliers and army size of 0 are removed before modeling.
# Plot attacker_size vs defender_size
sns.regplot(x='attacker_size',y='defender_size',data=battles_df)
display(battles_df[['attacker_size','defender_size']].corr())
# Plot attacker_size vs defender_size while remove 'Mance Rayder' and having an army size greater than 0
battles_df1 = battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_size > 0) & (battles_df.defender_size > 0)]
sns.regplot(x='attacker_size',y='defender_size',data=battles_df1)
display(battles_df1[['attacker_size','defender_size']].corr())
# Plot attacker_size vs defender_size using Plotly
px.scatter(battles_df1, x='attacker_size', y='defender_size',
hover_name="name",
hover_data=["name","year","attacker_outcome","attacker_king", "defender_king"],
trendline="ols")
The graph shows a weak linear correlation of 0.438731 and correlation coefficient of 0.192 (R-Square) which means that only 19.2 % of the variance is explained in the dependent variable. There are two distinct sections in the plots so breaking up the data into groups might give a better linear relationship between the army sizes.
battles_df1 = battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_size > 0) & (battles_df.defender_size > 0)]
sns.lmplot(x='attacker_size', y='defender_size', data=battles_df1[['attacker_size','defender_size']],
robust=True, ci=None, scatter_kws={"s": 80})
display(battles_df1[['attacker_size','defender_size']].corr())
battles_win_df1 = battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_outcome_flag == 1) & (battles_df.attacker_size > 0) & (battles_df.defender_size > 0) & (battles_df.attacker_size < 14000) & (battles_df.defender_size < 20000)]
sns.lmplot(x='attacker_size', y='defender_size', data=battles_win_df1[['attacker_size','defender_size']],
robust=True, ci=None, scatter_kws={"s": 80})
display(battles_df1[['attacker_size','defender_size']].corr())
# Split by battle_type
sns.lmplot(x='attacker_size', y='defender_size', data=battles_df1, col = "battle_type",palette="Set1")
display(battles_df1[['attacker_size','defender_size']].corr())
battles_win_df2 = battles_df1[(battles_df1['battle_type'] == 'ambush') | (battles_df1['battle_type'] =='pitched battle')]
battle_type_list = battles_win_df2['battle_type'].unique()
for i in range(len(battle_type_list)):
print('battle_type = ' + battle_type_list[i])
display(battles_win_df2.loc[battles_win_df2['battle_type'] == battle_type_list[i],['attacker_size','defender_size']].corr())
def scatter_plotter(df,x,y):
''' Creates a scatter plot with the independent and dependent variables
with the ability to scale and remove outliers.
:param df: dataframe
:param x: independent variable
:param y: dependent variable
'''
df = df[(df[x] > 0)]
battle_type_list = df['battle_type'].unique()
for i in range(len(battle_type_list)):
fig = px.scatter(df[(df.battle_type == battle_type_list[i])], x=x, y=y,
hover_name="name",
hover_data=["name","year","attacker_outcome","attacker_king", "defender_king"],
trendline="ols")
fig.update_layout(title= str(x) + ' vs ' + str(y) + ' - ' + str(battle_type_list[i]))
fig.show()
scatter_plotter(battles_win_df2,'attacker_size','defender_size')
Ambush — The linear model improves significantly when splitting the data by battle types. Here, the correlation coefficient (R-Square)is 0.839 which means that only 83.9 % of the variance is explained in the dependent variable filtered to ambush battle. This model could be used to estimate the size of the defending army if the attacking house is involved in an ambush but realistically, a defending army would not ready for this type of attack.
Pitched Battle — The linear model improves significantly when splitting the data by battle types. Here, the correlation coefficient (R-Square) is 0.468071 which means that only 46.8 % of the variance is explained in the dependent variable filter to ambush battle. There is somewhat a strong correlation of 0.68 so this model could be used to estimate the size of the defending army if the attacking house is involved a pitched battle. There was not a good model generated for the siege battle due to lack of data.
If an army is involved in an ambush or a pitched battle, they can use the linear model to estimate the size of the opposing army but using the ambush model would be more accurate than the pitched battle model.

Now comes the fun part of the post. I will predict the battle outcomes using various classification models:
There are less than 40 records in the battles dataset so I will have to clean the dataset as much as I can.
# Drop columns with missing data
model_df = battles_df.drop(columns = ['name','location','attacker_commander','defender_commander','attacker_outcome','attacker_outcome_flag','battle_size'])
attacker_outcome = battles_df['attacker_outcome_flag']
model_df.dtypes
model_df[model_df.columns[0:10]].head()
model_df[model_df.columns[11:23]].head()
# Add underscore to spaces
model_df[['attacker_king','defender_king','attacker_1','defender_1']] = model_df[['attacker_king','defender_king','attacker_1','defender_1']].replace('/','_',regex=True)
model_df[['attacker_king','defender_king','region','battle_type','attacker_1','defender_1']] = model_df[['attacker_king','defender_king','region','battle_type','attacker_1','defender_1']].replace(' ','_',regex=True)
model_df[model_df.columns[0:10]].head()
model_df[model_df.columns[10:24]].head()
# Get list of categorical variables
categorical_feature_mask = model_df.dtypes==object
categorical_cols = model_df.columns[categorical_feature_mask].tolist()
categorical_cols
# Convert categorical variables into binary variables such as Attacker King, Defender King, Attacker 1, Defender 1, battle type, and region
model_df1 = pd.get_dummies(model_df, columns=categorical_cols, prefix = categorical_cols)
model_df1.head()
model_df1.columns
defender_size and attacker_size¶Log transform attacker size and defender size in order to reduce the variance in the data
# Log-transform the skewed features
skewed = ['attacker_size', 'defender_size']
features_log_transformed = pd.DataFrame(data = model_df1)
features_log_transformed[skewed] = model_df1[skewed].apply(lambda x: np.log(x + 1))
battles_df['attacker_size'].hist(bins=20)
features_log_transformed['attacker_size'].hist(bins=20)
battles_df['defender_size'].hist(bins=20)
features_log_transformed['defender_size'].hist(bins=20)
Normalize numerical features such as count of attacker houses, count of defender houses, count of attacker commander, count of defender commander, and the log transformed attacker and defender sizes from the prior step
# attack_houses,defender_houses,attacker_commander_count,defender_commander_count,attacker_size,defender_size
# Import sklearn.preprocessing.StandardScaler
from sklearn.preprocessing import MinMaxScaler
# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)
numerical = ['attack_houses','defender_houses','attacker_commander_count','defender_commander_count','attacker_size','defender_size']
features_log_minmax_transform = pd.DataFrame(data = features_log_transformed)
features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])
# Show an example of a record with scaling applied
display(features_log_minmax_transform.head(n = 5))
# Export data for modeling
features_final = features_log_transformed
features_final.to_csv('../data/battles_data_model.csv',index = False)
I used 75% of the data as training data (28 records) and 25% as test data (10 records) and used them in the three models. I also used 20 k-fold cross-validation which is a resampling procedure used to evaluate machine learning models on a limited data sample.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_final,
attacker_outcome,
random_state = 0,
test_size = 0.25)
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
Logistic Regression
Random Forest
XG Boost
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)
logistic_acc = accuracies.mean()
logistic_std = accuracies.std()
display(logistic_acc)
display(logistic_std)
The accuracy using logistic regression is 91.6% with a standard deviation of 0.2327 and we have a good baseline classification model already which is good news.
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 20, criterion = 'entropy',random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)
rf_acc = accuracies.mean()
rf_std = accuracies.std()
display(rf_acc)
display(rf_std)
# TODO: Extract the feature importances using .feature_importances_
importances = classifier.feature_importances_
# Plot
vs.feature_plot(importances, X_train, y_train)
The accuracy using random forest is 95.0% with a standard deviation of 0.119. The accuracy improved by approximately 3.6% and the standard deviation decreased by almost half.
# Fitting XGBoost to the Training set
from xgboost import XGBClassifier
classifier = XGBClassifier(random_state=0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)
xgb_acc = accuracies.mean()
xgb_std = accuracies.std()
display(xgb_acc)
display(xgb_std)
Surprisingly, XGBoost performed worst than the Random Forest models (92.5%) but it is still a good model.
attacker_size, attacker_commander_count,attack_houses,defender_size, and defender_houses.¶model_data = {'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
'Accuracy': [logistic_acc, rf_acc, xgb_acc],
'Standard Deviation': [logistic_std,rf_std,xgb_std]
}
model_df = pd.DataFrame(model_data, columns = ['Model','Accuracy','Standard Deviation'])
model_df['Accuracy'] = model_df['Accuracy'].astype(float)
model_df['Standard Deviation']= model_df['Standard Deviation'].astype(float)
model_df = model_df.sort_values('Accuracy',ascending=False)
display(model_df)
chart = sns.catplot(x="Model", y="Accuracy",kind="bar", data=model_df)
I reduced the dataset into the top five importance features and plugged it into the Random Forest model again.
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 20, criterion = 'entropy',random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)
display(accuracies.mean())
display(accuracies.std())
# TODO: Extract the feature importances using .feature_importances_
importances = classifier.feature_importances_
vs.feature_plot(importances, X_train, y_train)
# Import functionality for cloning a model
from sklearn.base import clone
# Reduce the feature space
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_test_reduced = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:5]]]
# Train on the "best" model found from grid search earlier
clf = classifier.fit(X_train_reduced, y_train)
# Make new predictions
reduced_predictions = clf.predict(X_test_reduced)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = clf, X = X_train_reduced, y = y_train, cv = 20)
rfr_acc = accuracies.mean()
rfr_std = accuracies.std()
display(rfr_acc)
display(rfr_std)
importances = clf.feature_importances_
vs.feature_plot(importances, X_train_reduced, y_train)
After we reduced the features down to the 5 most significant predictors, we get a slight improvement in accuracy of 96.6% and the standard deviation decreased slightly to 0.1. This is okay as we now have a model that can predict battle outcomes for Game of Thrones!
The most important factors are:
attacker_size - The size of the attacking house matter (unless you are Balon/Euron Greyjoy who wins with smaller armies)
attacker_commander_count - The count of the attacker commander matters as well.
attack_houses - The number of attacking houses that are in the battle.
defender_size - The size of the defender house matter (unless you are Balon/Euron Greyjoy who wins with smaller armies)
defender_houses - The number of defending houses that are in the battle.
model_data = {'Model': ['Logistic Regression', 'Random Forest', 'XGBoost','Random Forest Reduced'],
'Accuracy': [logistic_acc, rf_acc, xgb_acc, rfr_acc],
'Standard Deviation': [logistic_std,rf_std,xgb_std, xgb_std]
}
model_df = pd.DataFrame(model_data, columns = ['Model','Accuracy','Standard Deviation'])
model_df['Accuracy'] = model_df['Accuracy'].astype(float)
model_df['Standard Deviation']= model_df['Standard Deviation'].astype(float)
model_df = model_df.sort_values('Accuracy',ascending=False)
display(model_df)
chart = sns.catplot(x="Model", y="Accuracy",kind="bar", data=model_df)
display(chart.set_xticklabels(rotation=45, horizontalalignment='right'))
The Random Forest Reduced model improved to 96.6% and the standard deviation improved slightly. It looks like we got our model to predict battle outcomes!
1) Joffrey/Tommen Baratheon wins the most battles against defensive armies as the Greyjoys never had to fight a defending army of more than zero.
2) If an army is involved in an ambush or a pitched battle, they can use the linear model to estimate the size of the opposing army but using the ambush model would be more accurate than the pitched battle model.
3) Random Forest model can be used to predict battle outcome with an accuracy of 96.6% and the important factors that determine a victory are attacker size, attacker commander count, attack houses, defender size, defender houses.
